Despite significant progress in object categorization, in recent years, a number of important challenges remain; mainly, the ability to learn from limited labeled data and to recognize object classes within large, potentially open, set of labels. Zero-shot learning is one way of addressing these challenges, but it has only been shown to work with limited sized class vocabularies and typically requires separation between supervised and unsupervised classes, allowing former to inform the latter but not vice versa. We propose the notion of vocabulary-informed learning to alleviate the above mentioned challenges and address problems of supervised, zero-shot, generalized zero-shot and open set recognition using a unified framework. Specifically, we propose a weighted maximum margin framework for semantic manifold-based recognition that incorporates distance constraints from (both supervised and unsupervised) vocabulary atoms. Distance constraints ensure that labeled samples are projected closer to their correct prototypes, in the embedding space, than to others. We illustrate that resulting model shows improvements in supervised, zero-shot, generalized zero-shot, and large open set recognition, with up to 310K class vocabulary on Animal with Attributes and ImageNet datasets.
translated by 谷歌翻译
Recent advances in pixel-level tasks (e.g., segmentation) illustrate the benefit of long-range interactions between aggregated region-based representations that can enhance local features. However, such pixel-to-region associations and the resulting representation, which often take the form of attention, cannot model the underlying semantic structure of the scene (e.g., individual objects and, by extension, their interactions). In this work, we take a step toward addressing this limitation. Specifically, we propose an architecture where we learn to project image features into latent region representations and perform global reasoning across them, using a transformer, to produce contextualized and scene-consistent representations that are then fused with original pixel-level features. Our design enables the latent regions to represent semantically meaningful concepts, by ensuring that activated regions are spatially disjoint and unions of such regions correspond to connected object segments. The resulting semantic global reasoning (SGR) is end-to-end trainable and can be combined with any semantic segmentation framework and backbone. Combining SGR with DeepLabV3 results in a semantic segmentation performance that is competitive to the state-of-the-art, while resulting in more semantically interpretable and diverse region representations, which we show can effectively transfer to detection and instance segmentation. Further, we propose a new metric that allows us to measure the semantics of representations at both the object class and instance level.
translated by 谷歌翻译
Neural architectures can be naturally viewed as computational graphs. Motivated by this perspective, we, in this paper, study neural architecture search (NAS) through the lens of learning random graph models. In contrast to existing NAS methods which largely focus on searching for a single best architecture, i.e, point estimation, we propose GraphPNAS a deep graph generative model that learns a distribution of well-performing architectures. Relying on graph neural networks (GNNs), our GraphPNAS can better capture topologies of good neural architectures and relations between operators therein. Moreover, our graph generator leads to a learnable probabilistic search method that is more flexible and efficient than the commonly used RNN generator and random search methods. Finally, we learn our generator via an efficient reinforcement learning formulation for NAS. To assess the effectiveness of our GraphPNAS, we conduct extensive experiments on three search spaces, including the challenging RandWire on TinyImageNet, ENAS on CIFAR10, and NAS-Bench-101/201. The complexity of RandWire is significantly larger than other search spaces in the literature. We show that our proposed graph generator consistently outperforms RNN-based one and achieves better or comparable performances than state-of-the-art NAS methods.
translated by 谷歌翻译
There has been a recent explosion of impressive generative models that can produce high quality images (or videos) conditioned on text descriptions. However, all such approaches rely on conditional sentences that contain unambiguous descriptions of scenes and main actors in them. Therefore employing such models for more complex task of story visualization, where naturally references and co-references exist, and one requires to reason about when to maintain consistency of actors and backgrounds across frames/scenes, and when not to, based on story progression, remains a challenge. In this work, we address the aforementioned challenges and propose a novel autoregressive diffusion-based framework with a visual memory module that implicitly captures the actor and background context across the generated frames. Sentence-conditioned soft attention over the memories enables effective reference resolution and learns to maintain scene and actor consistency when needed. To validate the effectiveness of our approach, we extend the MUGEN dataset and introduce additional characters, backgrounds and referencing in multi-sentence storylines. Our experiments for story generation on the MUGEN, the PororoSV and the FlintstonesSV dataset show that our method not only outperforms prior state-of-the-art in generating frames with high visual quality, which are consistent with the story, but also models appropriate correspondences between the characters and the background.
translated by 谷歌翻译
场景图生成的任务需要在给定图像(或视频)中识别对象实体及其相应的交互谓词。由于组合较大的解决方案空间,现有的场景图生成方法假设关节分布的某些分解以使估计可行(例如,假设对象在有条件地与谓词预测无关)。但是,在所有情况下,这种固定的分解并不是理想的(例如,对于相互作用中需要的对象很小且本身不可辨别的图像)。在这项工作中,我们建议使用马尔可夫随机字段中传递消息,提出一个针对场景图生成的新颖框架,并在图像上引入动态调节。这是作为迭代改进过程实现的,其中每个修改都在上一个迭代中生成的图上进行条件。跨改进步骤的这种条件允许对实体和关系进行联合推理。该框架是通过基于小说和端到端的可训练变压器建筑实现的。此外,建议的框架可以改善现有的方法性能。通过有关视觉基因组和动作基因组基准数据集的广泛实验,我们在场景图生成上显示了改善的性能。
translated by 谷歌翻译
自从神经辐射场(NERF)出现以来,神经渲染引起了极大的关注,并且已经大大推动了新型视图合成的最新作品。最近的重点是在模型上过度适合单个场景,以及学习模型的一些尝试,这些模型可以综合看不见的场景的新型视图,主要包括将深度卷积特征与类似NERF的模型组合在一起。我们提出了一个不同的范式,不需要深层特征,也不需要类似NERF的体积渲染。我们的方法能够直接从现场采样的贴片集中直接预测目标射线的颜色。我们首先利用表现几何形状沿着每个参考视图的异性线提取斑块。每个贴片线性地投影到1D特征向量和一系列变压器处理集合中。对于位置编码,我们像在光场表示中一样对射线进行参数化,并且至关重要的差异是坐标是相对于目标射线的规范化的,这使我们的方法与参考帧无关并改善了概括。我们表明,即使接受比先前的工作要少得多的数据训练,我们的方法在新颖的综合综合方面都超出了最新的视图综合。
translated by 谷歌翻译
现代深度学习需要大规模广泛标记的数据集进行培训。少量学习旨在通过有效地从少数标记的例子中学习来缓解这个问题。在先前提出的少量视觉分类器中,假设对分类器决定的特征歧管具有不相关的特征尺寸和均匀特征方差。在这项工作中,我们专注于通过提出以低标签制度运行的差异敏感的模型来解决这一假设引起的限制。第一种方法简单的CNAP,采用基于分层正规的Mahalanobis距离基于距离的分类器,与现有神经自适应特征提取器的状态相结合,以在元数据集,迷你成像和分层图像基准基准上实现强大性能。我们进一步将这种方法扩展到转换学习设置,提出转导压盖。这种转换方法将软k-means参数细化过程与两步任务编码器相结合,以实现使用未标记数据的改进的测试时间分类精度。转导CNAP在元数据集上实现了最先进的性能。最后,我们探讨了我们的方法(简单和转换)的使用“开箱即用”持续和积极的学习。大规模基准的广泛实验表明了这一点的鲁棒性和多功能性,相对说话,简单的模型。所有培训的模型检查点和相应的源代码都已公开可用。
translated by 谷歌翻译
新型视图综合的古典光场渲染可以准确地再现视图依赖性效果,例如反射,折射和半透明,但需要一个致密的视图采样的场景。基于几何重建的方法只需要稀疏的视图,但不能准确地模拟非兰伯语的效果。我们介绍了一个模型,它结合了强度并减轻了这两个方向的局限性。通过在光场的四维表示上操作,我们的模型学会准确表示依赖视图效果。通过在训练和推理期间强制执行几何约束,从稀疏的视图集中毫无屏蔽地学习场景几何。具体地,我们介绍了一种基于两级变压器的模型,首先沿着ePipoll线汇总特征,然后沿参考视图聚合特征以产生目标射线的颜色。我们的模型在多个前进和360 {\ DEG}数据集中优于最先进的,具有较大的差别依赖变化的场景更大的边缘。
translated by 谷歌翻译
我们引入分层可控的视频生成,在没有任何监督的情况下,将视频的初始帧分解为前景和背景层,用户可以通过简单地操纵前景掩模来控制视频生成过程。关键挑战是无监督的前景背景分离,这是模糊的,并且能够预测用户操作,可以访问未获得原始视频序列。我们通过提出两阶段学习程序来解决这些挑战。在第一阶段,随着丰富的损失和动态前景大小,我们学习如何将帧分离为前景和背景图层,并在这些图层上调节,如何使用VQ-VAE发生器生成下一帧。在第二阶段,我们通过将(参数化)控制从未来框架拟合(参数化)控制来进行该网络来预测对掩码的编辑。我们展示了该学习的有效性和更粒度的控制机制,同时说明了在两个基准数据集上的最先进的性能。我们提供了一个视频摘要以及HTTPS://gabriel-中的视频结果.Github.io/layered_controllable_video_generation
translated by 谷歌翻译
Action recognition models have achieved impressive results by incorporating scene-level annotations, such as objects, their relations, 3D structure, and more. However, obtaining annotations of scene structure for videos requires a significant amount of effort to gather and annotate, making these methods expensive to train. In contrast, synthetic datasets generated by graphics engines provide powerful alternatives for generating scene-level annotations across multiple tasks. In this work, we propose an approach to leverage synthetic scene data for improving video understanding. We present a multi-task prompt learning approach for video transformers, where a shared video transformer backbone is enhanced by a small set of specialized parameters for each task. Specifically, we add a set of ``task prompts'', each corresponding to a different task, and let each prompt predict task-related annotations. This design allows the model to capture information shared among synthetic scene tasks as well as information shared between synthetic scene tasks and a real video downstream task throughout the entire network. We refer to this approach as ``Promptonomy'', since the prompts model a task-related structure. We propose the PromptonomyViT model (PViT), a video transformer that incorporates various types of scene-level information from synthetic data using the ``Promptonomy'' approach. PViT shows strong performance improvements on multiple video understanding tasks and datasets.
translated by 谷歌翻译